Assignment A2 - Visualization

Course "Data processing and Visualization", IE500417, NTNU. Fall 2020

https://www.ntnu.edu/studies/courses/IE500417

Note: plagiarism is strictly forbidden! You should never copy any source code from other students. If you use any code written by others (except the standards libraries: NumPy, SciPy, Pandas, etc), provide a reference.

If the teachers see that your work is mostly copy+paste from online code snippets, the grade can be reduced.

If a case of plagiarism is detected, it will be reported to the administration.

Task description

The goal of this assignment is to get familiar with visualization options.

You can choose whichever visualization tools work best for you. Suggested toolset: either Plotly or Matplotlib. Check the notebooks on Blackboard Week 02 with examples.

Submission details (Same as A1)

The assignment must be handed in on Blackboard. The following must be handed in:

  1. Report in HTML or PDF format describing the results of this assignment. Preferably, it is generated from the Jupyter notebook you used (Hint: In Jupyter: File > Download as > HTML). Alternatively (if you use plain Python or other tools), prepare a readable report that contains figures and source code snippets necessary to understand your work.
  2. Source code that you used to generate the results. This could be the the Jupyter notebook file, python source files, Matlab files, etc.

Deadlines and grading information on Blackboard.

Detailed steps

First the imports

In [1]:
import pandas as pd

Now do the imports necessary for your visualization tools (MatPlotLib, Plotly, or whatever you use)

TASK FOR YOU:

In [2]:
# Your code here - do the imports

# Import plotly chart types
# import plotly.graph_objs as go
import plotly.graph_objects as go

# We will want to work offline, without Plotly cloud services
# import plotly.offline as ploff
# from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

# This tells Jupyter notebook to download all necessary Javascripts from Plotly so that 
# it can continue to work offline from now on
# init_notebook_mode(connected=True)

# Reference: https://plotly.com/python/renderers/#interactive-renderers
# import plotly.io as pio
# pio.renderers

We will use data set on olympic game results. The data set comes from Kaggle. Hint: you can look at different data processing solutions by other people there. Just remember that this work is your individual contribution!

In [3]:
olympics = pd.read_csv("athlete_events.csv")
olympics.head()
Out[3]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
0 1 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Summer Barcelona Basketball Basketball Men's Basketball NaN
1 2 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 Summer London Judo Judo Men's Extra-Lightweight NaN
2 3 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark DEN 1920 Summer 1920 Summer Antwerpen Football Football Men's Football NaN
3 4 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold
4 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 500 metres NaN

Task 1: Inspect the columns

Always when you work with data, it is important to find out first what kind of data are you working with. How many rows are there, how many columns? What are column types, are there any empty values?

One thing that you can always do: get quick statistics using the .describe() method. That gives the first impression of the values.

In [4]:
olympics.describe()
Out[4]:
ID Age Height Weight Year
count 271116.000000 261642.000000 210945.000000 208241.000000 271116.000000
mean 68248.954396 25.556898 175.338970 70.702393 1978.378480
std 39022.286345 6.393561 10.518462 14.348020 29.877632
min 1.000000 10.000000 127.000000 25.000000 1896.000000
25% 34643.000000 21.000000 168.000000 60.000000 1960.000000
50% 68205.000000 24.000000 175.000000 70.000000 1988.000000
75% 102097.250000 28.000000 183.000000 79.000000 2002.000000
max 135571.000000 97.000000 226.000000 214.000000 2016.000000

Another typical issue: missing values. In the snapshot above we see that some cells have value NaN. It means "not a number", i.e, the value is missing.

Pandas DataFrame has method .isnull(). Use it to find out which columns have empty values.

TASK FOR YOU:

In [5]:
# Your code here

# List of columns with missing values
# Reference: https://stackoverflow.com/questions/37366717/pandas-print-column-name-with-missing-values
olympics.columns[olympics.isnull().any()].tolist()
Out[5]:
['Age', 'Height', 'Weight', 'Medal']

Your answer here (in case you need an explanatory comment in addition to code)

...

Task 2: Clarify attribute types

Before you visualize data, you should check what types do the attributes have. Your task: inspect all the columns and for each of them clarify: is this attibute quantitative, ordinal or categorical?

Store the names of all categorical attributes in a list and call it categories. In a similar fashion, store ordinal attribute names in a variable named ordinals and quantitative attribute names in quants.

For example, if you think that attributes age and height are categorical while year and team are quantitative, your variables should be:

categories = ["Age", "Height"]
quants = ["Year", "Team"]
ordinals = []

TASK FOR YOU:

In [6]:
# Your code here
categories = ["Sex", "Team", "NOC", "Season", "City", "Sport", "Event", "Year", "ID", "Name"]
quants = ["Age", "Height", "Weight"]
ordinals = ["Games", "Medal"]

Task 3: Display Nordic gold

In this step you will visualize gold medals for Nordic countries. First, we select data for each country (this is just one option, there are many ways to do it).

In [7]:
# Comment: This returns a filter: a list with boolean (true/false) values for each row: true if the country code is "NOR":
# olympics["NOC"] == "NOR"
# This selects only those rows from the dataframe, where the filter says "True" in the corresponding row:
# olympics[olympics["NOC"] == "NOR"]
# etc...

nor = olympics[olympics["NOC"] == "NOR"]
swe = olympics[olympics["NOC"] == "SWE"]
den = olympics[olympics["NOC"] == "DEN"]
fin = olympics[olympics["NOC"] == "FIN"]
isl = olympics[olympics["NOC"] == "ISL"]

Task 3.1: Visualize total number of gold medals per Nordic country, in a single plot.

Use a single figure that shows these two attributes: a country and number of medals (for a particular country). When you choose the type of chart, keep in mind the attribute types that you are visualizing!

TASK FOR YOU:

In [8]:
# Your code here

# Filter Nordic countries
nordic = olympics[olympics["NOC"].isin(["NOR", "SWE", "DEN", "FIN", "ISL"])]

# Filter Nordic Gold medal winners
nordic_gold_medals =  nordic[nordic["Medal"] == "Gold"]

noc_grouping = nordic_gold_medals.groupby("NOC")
gold_medals_count = noc_grouping.count()["Medal"]

# To account for 0 gold medals won by a country
# Reference: https://datascience.stackexchange.com/questions/18154/how-to-count-categorical-values-including-zero-occurrence
gold_medals_count = gold_medals_count.reindex(nordic["NOC"].unique().tolist(), fill_value=0)

fig = go.Figure([go.Bar(x=gold_medals_count.index, y=gold_medals_count.values, marker_color="gold")])

# Change the bar order
fig.update_layout(xaxis={'categoryorder':'total descending'})

fig.update_xaxes(title_text="Country")
fig.update_yaxes(title_text="Number of gold medals")

fig.show()

The next task is to show one more attribute: Season. You should show the total number of medals per Nordic country, but separated by season. I.e, how many medals has Norway got in Summer Olympics, how many in Winter olympics; how many gold medals has Sweden got in Winter Olympics, how many in Summer, etc. Think about the "Season" attribute - what type does it have and what channel would be appropriate to visualize it?

Task 3.2: Visualize total number of gold medals per Nordic country, per season.

TASK FOR YOU:

In [9]:
# Your code here

# Filter Nordic Gold medal winners by season
noc_season_grouping = nordic_gold_medals.groupby(["NOC", "Season"])
seasonal_gold_medals_count = noc_season_grouping.count()["Medal"]

# To account for 0 gold medals won by a country
# Reference: https://datascience.stackexchange.com/questions/18154/how-to-count-categorical-values-including-zero-occurrence
# Reference: https://cmdlinetips.com/2020/05/fun-with-pandas-groupby-aggregate-multi-index-and-unstack/
iterables = [nordic["NOC"].unique().tolist(),  nordic["Season"].unique().tolist()]
index = pd.MultiIndex.from_product(iterables)
seasonal_gold_medals_count_unstacked = seasonal_gold_medals_count.reindex(index, fill_value=0).unstack()

# To match previous bar chart
# Reference: https://stackoverflow.com/questions/17156662/pandas-dataframe-unstack-changes-order-of-row-and-column-headers
seasonal_gold_medals_count = seasonal_gold_medals_count_unstacked.reindex(gold_medals_count.sort_values(ascending=False).index)

x_axis = seasonal_gold_medals_count.index
fig = go.Figure(go.Bar(x=x_axis, y=seasonal_gold_medals_count["Summer"].tolist(), name='Summer'))
fig.add_trace(go.Bar(x=x_axis, y=seasonal_gold_medals_count["Winter"].tolist(), name='Winter'))

# Change the bar mode
fig.update_layout(barmode='stack', yaxis={'categoryorder':'total descending'})

fig.update_xaxes(title_text="Country")
fig.update_yaxes(title_text="Number of gold medals")

fig.show()

Task 3.3: Reflect/explain: why did you choose these chart types? What marks and channels are they using to convey the information of each attribute?

TASK FOR YOU:

Your reflection here

I chose bar chart to compare the relative performance of the Nordic countries. I used a stacked bar chart to visualize the variation of gold medals by season in proportion to the total gold medals. I aligned the x-axes of the bar charts for easy total versus seasonal comparison.

Marks: Lines
Channels: Height for number of gold medals, Colors for seasons

Task 4: Visualize trend

Visualize number of women participating in the Olympics (with or without medals) over the years in a single chart.

TASK FOR YOU:

In [10]:
# Your code here

# Number of women for every olympic year
women_olympians = olympics[olympics["Sex"] == "F"].groupby("Year").count()["ID"]

fig = go.Figure(data=go.Scatter(x=women_olympians.index, y=women_olympians.values, mode="lines+markers"))

fig.update_xaxes(title_text="Year")
fig.update_yaxes(title_text="Number of women")

fig.show()

TASK FOR YOU: The chart probably will have som weird spikes towards the end. Do some analysis to understand what it means and why does it appear.

Your reflection here

The reason for spikes towards the end is that both the Summer and the Winter Olympic Games were held in the same year until 1992; after that, beginning with the 1994 Games, the Winter Olympics were held every four years, two years after each Summer Olympics.

TASK FOR YOU: Find a reasonable way to aggregate the data to get the overall trend over the years. I.e., get rid of the spikes.

In [11]:
# Your code here

# Number of women for every olympic year per season
seasonal_women_olympians =  olympics[olympics["Sex"] == "F"].groupby(["Year", "Season"]).count()["ID"].unstack()

summer_women_olympians = seasonal_women_olympians["Summer"].dropna()
winter_women_olympians = seasonal_women_olympians["Winter"].dropna()

fig = go.Figure(data=[
    go.Scatter(name="Summer", x=summer_women_olympians.index, y=summer_women_olympians.values, mode="lines+markers"),
    go.Scatter(name="Winter", x=winter_women_olympians.index, y=winter_women_olympians.values, mode="lines+markers")
])

fig.update_xaxes(title_text="Year")
fig.update_yaxes(title_text="Number of women")

fig.show()

Your reflection here - Explain why your approach makes sense.

I chose to plot the trends for the Summer and the Winter Olympic Games separately. Since they are not held in the same year after 1992, it was sensible to take this approach to visualize the trends.

Task 5: Visualize statistics

Sometimes you want to look at distribution of values for an attribute. There are several choices. Histogram is one of them. Idea of histogram: we split the value range into intervals of specific size and count how many attributes have value that fits within each interval.

Task 5.1: Show histogram of weight for all gold medal winners in the WINTER season.

TASK FOR YOU:

In [12]:
# Your code here

# Filter number of gold medals in Winter Olympics
winter_gold_medal_filter = (olympics["Season"] == "Winter") & (olympics["Medal"] == "Gold")

winter_gold_medals = olympics[winter_gold_medal_filter]

fig = go.Figure(data=[go.Histogram(x=winter_gold_medals["Weight"])])

fig.update_xaxes(title_text="Weight")
fig.update_yaxes(title_text="Number of gold medals")

fig.show()

Task 5.2: Show histogram of weight for all gold medal winners in the SUMMER season. Note: the histogram should be comparable with the WINTER histogram.

TASK FOR YOU:

In [13]:
# Your code here

# Filter number of gold medals in Summer Olympics
summer_gold_medal_filter = (olympics["Season"] == "Summer") & (olympics["Medal"] == "Gold")

summer_gold_medals = olympics[summer_gold_medal_filter]

fig = go.Figure(data=[go.Histogram(
    x=summer_gold_medals["Weight"],
    xbins=dict( # bins used for histogram
        size=2 # To make this comparable with the WINTER histogram
    ))])

fig.update_xaxes(title_text="Weight")
fig.update_yaxes(title_text="Number of gold medals")

fig.show()

Task 6: Visualize trend of statistics (a bit more challenging)

In this task you will visualize how one attribute varies over time. But instead of visualizing a single value, we will look at the statistics.

First, we calculate Body Mass Index for all participants. Formula: BMI = Weight / (Height * Height)

In [14]:
olympics["BMI"] = olympics["Weight"] / (olympics["Height"] ** 2)

Your task is to group the entries by year, for each year show the statistics of BMI: the mean BMI value and the range of values. There are different options available, including:

  • Show a box plot for each year
  • Show Bollinger Bands (R)" for the BMI variable. Bollinger Bands is typically used in finance. But it uses statistical concepts in the core: a moving average line shown in the middle, upper and lower bands shown as the mean +/- two standard deviations.
  • Be creative! :)

Note: the idea is to show this in a single plot. Don't create a separate box plot for every year.

Some inspirational examples from a Kaggle solution by Marco Giuseppe de Pinto):

TASK FOR YOU:

In [15]:
# Your code here

years = olympics["Year"].sort_values().unique()

fig = go.Figure()

for year in years:
    fig.add_trace(go.Box(
          y=olympics[olympics["Year"] == year]["BMI"].dropna(),
          name=int(year),
          boxmean=True,
          jitter=0.5,
          whiskerwidth=0.2,
          marker_size=2,
          line_width=1)
      )
    
fig.update_xaxes(title_text="Year")
fig.update_yaxes(title_text="BMI")

fig.show()

Reflection

Please reflect on the following questions:

  1. How did the assignment go? Was it easy or hard?
    The assignment went well. It was not very hard.

  2. How many hours did you spend on it?
    8-10 hours.

  3. What was the most time-consuming part?
    Task 3.2.

  4. If you need to do similar things later in your professional life, how can you improve? How can you do it more efficiently?
    Go through the examples and documentation, understand the task at hand before implementation. Look for good quality resources.

  5. Was tehre something you would expect to learn that this exercise did not include?
    Use Matplotlib as I used Plotly in this assignment.

  6. Was there something that does not make sense?
    No.

In [ ]: